Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Merging of multi-string BWTs with applications.

Identifieur interne : 000E85 ( Ncbi/Merge ); précédent : 000E84; suivant : 000E86

Merging of multi-string BWTs with applications.

Auteurs : James Holt [États-Unis] ; Leonard Mcmillan [États-Unis]

Source :

RBID : pubmed:25172922

Descripteurs français

English descriptors

Abstract

The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis.

DOI: 10.1093/bioinformatics/btu584
PubMed: 25172922

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:25172922

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Merging of multi-string BWTs with applications.</title>
<author>
<name sortKey="Holt, James" sort="Holt, James" uniqKey="Holt J" first="James" last="Holt">James Holt</name>
<affiliation wicri:level="2">
<nlm:affiliation>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599</wicri:regionArea>
<placeName>
<region type="state">Caroline du Nord</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Mcmillan, Leonard" sort="Mcmillan, Leonard" uniqKey="Mcmillan L" first="Leonard" last="Mcmillan">Leonard Mcmillan</name>
<affiliation wicri:level="2">
<nlm:affiliation>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599</wicri:regionArea>
<placeName>
<region type="state">Caroline du Nord</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2014">2014</date>
<idno type="RBID">pubmed:25172922</idno>
<idno type="pmid">25172922</idno>
<idno type="doi">10.1093/bioinformatics/btu584</idno>
<idno type="wicri:Area/PubMed/Corpus">001864</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001864</idno>
<idno type="wicri:Area/PubMed/Curation">001864</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001864</idno>
<idno type="wicri:Area/PubMed/Checkpoint">001829</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">001829</idno>
<idno type="wicri:Area/Ncbi/Merge">000E85</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Merging of multi-string BWTs with applications.</title>
<author>
<name sortKey="Holt, James" sort="Holt, James" uniqKey="Holt J" first="James" last="Holt">James Holt</name>
<affiliation wicri:level="2">
<nlm:affiliation>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599</wicri:regionArea>
<placeName>
<region type="state">Caroline du Nord</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Mcmillan, Leonard" sort="Mcmillan, Leonard" uniqKey="Mcmillan L" first="Leonard" last="Mcmillan">Leonard Mcmillan</name>
<affiliation wicri:level="2">
<nlm:affiliation>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599</wicri:regionArea>
<placeName>
<region type="state">Caroline du Nord</region>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics (Oxford, England)</title>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2014" type="published">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Data Compression</term>
<term>Genomics (methods)</term>
<term>High-Throughput Nucleotide Sequencing (methods)</term>
<term>Mice</term>
<term>Sequence Alignment</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Alignement de séquences</term>
<term>Animaux</term>
<term>Compression de données</term>
<term>Génomique ()</term>
<term>Souris</term>
<term>Séquençage nucléotidique à haut débit ()</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Genomics</term>
<term>High-Throughput Nucleotide Sequencing</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Data Compression</term>
<term>Mice</term>
<term>Sequence Alignment</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Alignement de séquences</term>
<term>Animaux</term>
<term>Compression de données</term>
<term>Génomique</term>
<term>Souris</term>
<term>Séquençage nucléotidique à haut débit</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" IndexingMethod="Curated" Owner="NLM">
<PMID Version="1">25172922</PMID>
<DateCompleted>
<Year>2015</Year>
<Month>03</Month>
<Day>05</Day>
</DateCompleted>
<DateRevised>
<Year>2018</Year>
<Month>12</Month>
<Day>02</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1367-4811</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>30</Volume>
<Issue>24</Issue>
<PubDate>
<Year>2014</Year>
<Month>Dec</Month>
<Day>15</Day>
</PubDate>
</JournalIssue>
<Title>Bioinformatics (Oxford, England)</Title>
<ISOAbbreviation>Bioinformatics</ISOAbbreviation>
</Journal>
<ArticleTitle>Merging of multi-string BWTs with applications.</ArticleTitle>
<Pagination>
<MedlinePgn>3524-31</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1093/bioinformatics/btu584</ELocationID>
<Abstract>
<AbstractText Label="MOTIVATION" NlmCategory="BACKGROUND">The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis.</AbstractText>
<AbstractText Label="RESULTS" NlmCategory="RESULTS">We present a novel algorithm that merges multi-string BWTs in [Formula: see text] time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using [Formula: see text] bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications.</AbstractText>
<CopyrightInformation>© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.</CopyrightInformation>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Holt</LastName>
<ForeName>James</ForeName>
<Initials>J</Initials>
<AffiliationInfo>
<Affiliation>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>McMillan</LastName>
<ForeName>Leonard</ForeName>
<Initials>L</Initials>
<AffiliationInfo>
<Affiliation>Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<GrantList CompleteYN="Y">
<Grant>
<GrantID>P50 HG006582</GrantID>
<Acronym>HG</Acronym>
<Agency>NHGRI NIH HHS</Agency>
<Country>United States</Country>
</Grant>
<Grant>
<GrantID>P50GM076468</GrantID>
<Acronym>GM</Acronym>
<Agency>NIGMS NIH HHS</Agency>
<Country>United States</Country>
</Grant>
<Grant>
<GrantID>P50 MH090338</GrantID>
<Acronym>MH</Acronym>
<Agency>NIMH NIH HHS</Agency>
<Country>United States</Country>
</Grant>
<Grant>
<GrantID>P50HG006582</GrantID>
<Acronym>HG</Acronym>
<Agency>NHGRI NIH HHS</Agency>
<Country>United States</Country>
</Grant>
<Grant>
<GrantID>P50MH090338</GrantID>
<Acronym>MH</Acronym>
<Agency>NIMH NIH HHS</Agency>
<Country>United States</Country>
</Grant>
<Grant>
<GrantID>P50 GM076468</GrantID>
<Acronym>GM</Acronym>
<Agency>NIGMS NIH HHS</Agency>
<Country>United States</Country>
</Grant>
</GrantList>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D052061">Research Support, N.I.H., Extramural</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2014</Year>
<Month>08</Month>
<Day>28</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>Bioinformatics</MedlineTA>
<NlmUniqueID>9808944</NlmUniqueID>
<ISSNLinking>1367-4803</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="Y">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000818" MajorTopicYN="N">Animals</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D044962" MajorTopicYN="N">Data Compression</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D023281" MajorTopicYN="N">Genomics</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="N">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D059014" MajorTopicYN="N">High-Throughput Nucleotide Sequencing</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D051379" MajorTopicYN="N">Mice</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016415" MajorTopicYN="N">Sequence Alignment</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="entrez">
<Year>2014</Year>
<Month>8</Month>
<Day>31</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2014</Year>
<Month>8</Month>
<Day>31</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2015</Year>
<Month>3</Month>
<Day>7</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">25172922</ArticleId>
<ArticleId IdType="pii">btu584</ArticleId>
<ArticleId IdType="doi">10.1093/bioinformatics/btu584</ArticleId>
<ArticleId IdType="pmc">PMC4318930</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Bioinformatics. 2012 Jun 1;28(11):1415-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22556365</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2008 May;18(5):810-20</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18340039</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2002 Apr;12(4):656-64</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11932250</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2010 Jun 15;26(12):i367-73</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20529929</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Rev Genet. 2014 Jan;15(1):56-62</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24322726</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2008 May;18(5):821-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18349386</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Algorithms Mol Biol. 2014 Feb 24;9(1):2</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24565280</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Bacteriol. 2008 Oct;190(20):6881-93</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18676672</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2012 Mar;22(3):549-56</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22156294</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2014 Jan 1;30(1):24-30</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23661694</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2009 Jun;19(6):1117-23</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19251739</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Mol Biol. 1990 Oct 5;215(3):403-10</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">2231712</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Science. 2011 Feb 11;331(6018):728-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21311016</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Biotechnol. 2012 Jul;30(7):627-30</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22781691</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Biol. 2009;10(3):R25</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19261174</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11504945</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2009 Jul 15;25(14):1754-60</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19451168</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Caroline du Nord</li>
</region>
</list>
<tree>
<country name="États-Unis">
<region name="Caroline du Nord">
<name sortKey="Holt, James" sort="Holt, James" uniqKey="Holt J" first="James" last="Holt">James Holt</name>
</region>
<name sortKey="Mcmillan, Leonard" sort="Mcmillan, Leonard" uniqKey="Mcmillan L" first="Leonard" last="Mcmillan">Leonard Mcmillan</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Ncbi/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000E85 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd -nk 000E85 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Ncbi
   |étape=   Merge
   |type=    RBID
   |clé=     pubmed:25172922
   |texte=   Merging of multi-string BWTs with applications.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/RBID.i   -Sk "pubmed:25172922" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021